This report explores a dataset containing expert quality scores and 12 other attributes for 1,599 different wines.
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
Tough to get a “feel” for the data from these numbers, but they are helpful in interpreting the plots below.
Quality appears to follow a rough normal distribution with a mean of 5.636. It also appears that fractional scores were not allowed, making the data highly discrete. No wine scored below a 3, nor above an 8.
The distribution of most variables appears to follow a rough bell curve, though many of them show a strong positive skew away from zero. The distribution of citric acid amounts is also interesting in that it appears to be slightly bimodal.
I’ll try increasing the number of bins and cutting off outliers to see if we can get any additional insight.
By increasing the number of bins from 30 to 100, we can see that the data is actually much noisier than the original plots show. You can also begin to see that many of the variables become discrete - likely we are seeing the limits of the precision of the measurement methods (i.e. the instruments only allowed them to get to the nearest whole number, the nearest tenth, nearest one hundredth, etc., or they rounded).
The shape of the citric acid distribution becomes even more interesting, with pronounced peaks near 0, 0.25, and 0.5. These are very round numbers - does this indicate a limitation in the measurement method?
I transformed the graphs using log base 10 along the x axis to get a better look at the long tails of the distributions (using 50 bins this time).
I wonder how strong the relationships are between these variables and the perceived quality of the wine. To get a better sense for this, let’s graph these variables for the best wines (quality of 7-8) vs. the sub-par wines (quality of 3-5).
## X fixed.acidity volatile.acidity citric.acid
## Min. : 8.0 Min. : 4.900 Min. :0.1200 Min. :0.0000
## 1st Qu.: 482.0 1st Qu.: 7.400 1st Qu.:0.3000 1st Qu.:0.3000
## Median : 939.0 Median : 8.700 Median :0.3700 Median :0.4000
## Mean : 831.7 Mean : 8.847 Mean :0.4055 Mean :0.3765
## 3rd Qu.:1089.0 3rd Qu.:10.100 3rd Qu.:0.4900 3rd Qu.:0.4900
## Max. :1585.0 Max. :15.600 Max. :0.9150 Max. :0.7600
## residual.sugar chlorides free.sulfur.dioxide
## Min. :1.200 Min. :0.01200 Min. : 3.00
## 1st Qu.:2.000 1st Qu.:0.06200 1st Qu.: 6.00
## Median :2.300 Median :0.07300 Median :11.00
## Mean :2.709 Mean :0.07591 Mean :13.98
## 3rd Qu.:2.700 3rd Qu.:0.08500 3rd Qu.:18.00
## Max. :8.900 Max. :0.35800 Max. :54.00
## total.sulfur.dioxide density pH sulphates
## Min. : 7.00 Min. :0.9906 Min. :2.880 Min. :0.3900
## 1st Qu.: 17.00 1st Qu.:0.9947 1st Qu.:3.200 1st Qu.:0.6500
## Median : 27.00 Median :0.9957 Median :3.270 Median :0.7400
## Mean : 34.89 Mean :0.9960 Mean :3.289 Mean :0.7435
## 3rd Qu.: 43.00 3rd Qu.:0.9973 3rd Qu.:3.380 3rd Qu.:0.8200
## Max. :289.00 Max. :1.0032 Max. :3.780 Max. :1.3600
## alcohol quality
## Min. : 9.20 Min. :7.000
## 1st Qu.:10.80 1st Qu.:7.000
## Median :11.60 Median :7.000
## Mean :11.52 Mean :7.083
## 3rd Qu.:12.20 3rd Qu.:7.000
## Max. :14.00 Max. :8.000
Some of the distributions and means for the higher quality wines are significantly different. For instance, the higher quality wines have citric acid that is 39% higher than the overall average, with volatile acidity and total sulfur dioxide that are 23% and 25% lower than the overall average, respectively. Let’s look at the sup-par wines.
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.600 Min. :0.1800 Min. :0.0000
## 1st Qu.: 298.8 1st Qu.: 7.100 1st Qu.:0.4600 1st Qu.:0.0800
## Median : 718.5 Median : 7.800 Median :0.5900 Median :0.2200
## Mean : 750.1 Mean : 8.142 Mean :0.5895 Mean :0.2378
## 3rd Qu.:1227.2 3rd Qu.: 8.900 3rd Qu.:0.6800 3rd Qu.:0.3600
## Max. :1598.0 Max. :15.900 Max. :1.5800 Max. :1.0000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 1.200 Min. :0.03900 Min. : 3.00
## 1st Qu.: 1.900 1st Qu.:0.07400 1st Qu.: 8.00
## Median : 2.200 Median :0.08100 Median :14.00
## Mean : 2.542 Mean :0.09299 Mean :16.57
## 3rd Qu.: 2.600 3rd Qu.:0.09400 3rd Qu.:23.00
## Max. :15.500 Max. :0.61100 Max. :68.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9926 Min. :2.740 Min. :0.3300
## 1st Qu.: 23.75 1st Qu.:0.9961 1st Qu.:3.200 1st Qu.:0.5200
## Median : 45.00 Median :0.9969 Median :3.310 Median :0.5800
## Mean : 54.65 Mean :0.9971 Mean :3.312 Mean :0.6185
## 3rd Qu.: 78.00 3rd Qu.:0.9979 3rd Qu.:3.400 3rd Qu.:0.6500
## Max. :155.00 Max. :1.0031 Max. :3.900 Max. :2.0000
## alcohol quality
## Min. : 8.400 Min. :3.000
## 1st Qu.: 9.400 1st Qu.:5.000
## Median : 9.700 Median :5.000
## Mean : 9.926 Mean :4.902
## 3rd Qu.:10.300 3rd Qu.:5.000
## Max. :14.900 Max. :5.000
Not surprisingly, the sub-par wines often vary from the averages in the opposite direction. Volatile acidity is 12% higher, citric acid is 12% lower, and total sulfur dioxide is 18% higher.
There are 1,599 wines in the dataset with 11 features (ficed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, and alcohol). All features are numerical, rather than categorical. All attributes are continuous, except for quality, which only exists in discrete whole numbers.
Other observations: -the median quality is 6, while the mean is 5.636 -there are no quality scores below 3 or above 8 -density has a very narrow range of .99 to 1.0 -the alcohol content ranges from 8.4 to 14.9 with a mean of 10.4
Quality is the most interesting variable, because it determines the value of the wine to the drinker (supposedly). The other variables are primarily interesting in how they affect the quality.
The variables that appear to impact quality the most are volatile acidity, citric acid, and total sulfur dioxide.
No, the variables appear to be fairly independent, so it did not make much sense to combine them or create any derived variables in this case. The only exception appears to be free sulfur dioxide and total sulfur dioxide, which are clearly related, but the relationship is simply subtractive in nature, so there no value in creating a new variable for the difference between the two.
Quality is a highly discrete variable, but there is no way to change this after the data has already been collected. Citric acid also has a somewhat unusual distribution, but it doesn’t seem particularly problematic for our analysis.
Looking at this correlation matrix, quality appears to be most influenced by volatile acidity, citric acid, total sulfur dioxide, density, sulphates, and alcohol. Moving forward, I will focus on these variables.
Plotting a correlation matrix with the smaller, more focused, set of variables now.
Now plotting these variables vs. quality.
The discrete nature of the quality scores makes this a bit tougher to interpret, but the correlations can still be plainly seen in the plots.
This helps us to visualize how likely manipulating each variable will have an impact on the quality of the wine. For instance, you can see that there are much fewer high quality wines with lower alcohol content.
Quality generally increases as volatile acidity decreases, which makes sense (acidity and the associated sourness are generally undesirable in wine). However, this same reasoning makes me surprised that higher citric acid actually increases the likely quality of the wine.
Sulfur dioxide and sulphates are used as perservatives in wine, and their relationships to quality does not seem particularly strong, so it probably makes economical sense to use them sparingly. Interestingly though, they have opposite relationships to quality. Perhaps this means that sulphates are a more desirable perservative, while sulfer dioxide is viewed as more of a contaminant. I wonder how they relate to one another, and if there is a cost difference or something in the creation process that would cause one to be more prevalent than the other.
Alcohol is positively correlated with quality, while density has a weaker negative correlation with quality. This makes intuitive sense because alcohol is less dense than water, so as the alcohol content increases, one would expect the density to decrease.
There is a strong negative correlation between volatile acidity and citric acid, which does not make intuitive sense to me without knowing more. There is also the negative correlation between alcohol and density, which I noted above.
Quality is most highly correlated with alcohol content.
Now let’s create a view that combines the two variables that are most highly correlated with quality.
This graph clearly shows a clump of lower-quality wines with low alcohol and high volatile acidity. It also shows that high quality wines generally have higher alcohol content and lower volatile acidity.
Alternately, we could switch the alcohol and volatile acidity variables to see if this plot provides any additional insights.
This plot reinforces the observations from the one above, though there are certainly some interesting outliers. All of the relationships observed so far also appear to be linear in nature, though the highly discrete nature of the quality scores makes it much more difficult to tell for certain. Therefore, tranforming axes with other scales (log, square root, etc.) does not seem to be appropriate.
As noted above, the strongest relationships were those between quality and alcohol (positive correlation), and quality and volatile acidity (negative correlation). Because of the mathematical relationship between alcohol and density, there is also a significant negative correlation between those two variables.
I was surprised to see that two different agents that are listed as wine preservatives (according to my quick research online) had opposite relationships with wine quality. That is, one had a positive correlation while the other had a negative correlation.
Wine scores are only awarded in whole numbers, making the data highly discrete. The scores also only range from 3 to 8, with the vast majority receiving 5’s or 6’s.
Histograms of the two variables most highly correlated with wine scores. You can see that alcohol is clearly skewed in the positive direction, while volatile acidity resembles more of a centered bell curve, though with some interesting gaps.
This plot shows wine scores plotted against its two most highly correlated variables, alcohol and volatile acidity. Higher quality wines clearly tend to have higher alcohol content and lower volatile acidity, though there are certainly some visible outliers.
The fact that higher acidity leads to lower wine scores is certainly not surprising, but I did not previously understand the different types of acidity (fixed, volatile, citric), and I will have to do more research to understand why only one measure of acidity has a strong impact on quality. I am also surprised that some of the other factors did not have a more significant impact (residual sugar, pH, chlorides, etc.).
I was also somewhat surprised to see the strongest correlation being that between quality and alcohol content. In fact, I might have originally guessed that higher alcohol levels would overpower the taste of the wine, resulting in lower scores. Clearly, this is not the case. This makes me wonder if there is an upper limit, past which the alcohol content would actually make the scores go down. For instance, there are fortified red wines and desert red wines that can go beyond the alcohol levels in this data set. Though they may be considered to be in a different class of wine entirely, so I’m not sure that the same scoring system could be applied anyway. Additional testing with data beyond the current ranges would help to understand the relationships more thoroughly.
It would also help to implement a more precise scoring system for future studies. The highly discrete nature of the scores makes it difficult to study the exact nature of the relationships and the shapes of the plots. Ideally future studies will use a scoring system that produces scores of a more continuous nature. Still, with the data we have, we were able to identify some very clear correlations between the few key variables discussed above.